{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--BOOK_INFORMATION-->\n",
    "<a href=\"https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv\" target=\"_blank\"><img align=\"left\" src=\"data/cover.jpg\" style=\"width: 76px; height: 100px; background: white; padding: 1px; border: 1px solid black; margin-right:10px;\"></a>\n",
    "*This notebook contains an excerpt from the book [Machine Learning for OpenCV](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv) by Michael Beyeler.\n",
    "The code is released under the [MIT license](https://opensource.org/licenses/MIT),\n",
    "and is available on [GitHub](https://github.com/mbeyeler/opencv-machine-learning).*\n",
    "\n",
    "*Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations.\n",
    "If you find this content useful, please consider supporting the work by\n",
    "[buying the book](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv)!*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Representing Data and Engineering Features](04.00-Representing-Data-and-Engineering-Features.ipynb) | [Contents](../README.md) | [Reducing the Dimensionality of the Data](04.02-Reducing-the-Dimensionality-of-the-Data.ipynb) >"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Preprocessing Data\n",
    "\n",
    "The more disciplined we are in handling our data, the better results we are likely to achieve\n",
    "in the end. The first step in this procedure is known as **data preprocessing**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Standardizing features\n",
    "\n",
    "Standardization refers to the process of scaling the data to have zero mean and unit\n",
    "variance. This is a common requirement for a wide range of machine learning algorithms,\n",
    "which might behave badly if individual features do not fulfill this requirement. We could\n",
    "manually standardize our data by subtracting from every data point the mean value ($\\mu$) of\n",
    "all the data, and dividing that by the variance ($\\sigma$) of the data; that is, for every feature $x$, we\n",
    "would compute $(x - \\mu) / \\sigma$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, scikit-learn offers a straightforward implementation of this process in its\n",
    "preprocessing module. Let's consider a 3 x 3 data matrix `X`, standing for three data points\n",
    "(rows) with three arbitrarily chosen feature values each (columns):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn import preprocessing\n",
    "import numpy as np\n",
    "X = np.array([[ 1., -2.,  2.],\n",
    "              [ 3.,  0.,  0.],\n",
    "              [ 0.,  1., -1.]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, standardizing the data matrix `X` can be achieved with the function `scale`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-0.26726124, -1.33630621,  1.33630621],\n",
       "       [ 1.33630621,  0.26726124, -0.26726124],\n",
       "       [-1.06904497,  1.06904497, -1.06904497]])"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_scaled = preprocessing.scale(X)\n",
    "X_scaled"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's make sure `X_scaled` is indeed standardized: zero mean, unit variance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([  7.40148683e-17,   0.00000000e+00,   0.00000000e+00])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_scaled.mean(axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition, every row of the standardized feature matrix should have variance of 1 (which\n",
    "is the same as checking for a standard deviation of 1 using `std`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 1.,  1.,  1.])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_scaled.std(axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Normalizing features\n",
    "\n",
    "Similar to standardization, **normalization** is the process of scaling individual samples to\n",
    "have unit norm. I'm sure you know that the norm stands for the **length of a vector**, and can\n",
    "be defined in different ways. We discussed two of them in the previous chapter: the L1\n",
    "norm (or Manhattan distance) and the L2 norm (or Euclidean distance).\n",
    "\n",
    "`X` can be normalized using the `normalize` function, and the L1 norm is specified by the `norm` keyword:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.2, -0.4,  0.4],\n",
       "       [ 1. ,  0. ,  0. ],\n",
       "       [ 0. ,  0.5, -0.5]])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_normalized_l1 = preprocessing.normalize(X, norm='l1')\n",
    "X_normalized_l1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, the L2 norm can be computed by specifying `norm='l2'`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.33333333, -0.66666667,  0.66666667],\n",
       "       [ 1.        ,  0.        ,  0.        ],\n",
       "       [ 0.        ,  0.70710678, -0.70710678]])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_normalized_l2 = preprocessing.normalize(X, norm='l2')\n",
    "X_normalized_l2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scaling features to a range\n",
    "\n",
    "An alternative to scaling features to zero mean and unit variance is to get features to lie\n",
    "between a given minimum and maximum value. Often these values are zero and one, so\n",
    "that the maximum absolute value of each feature is scaled to unit size. In scikit-learn, this\n",
    "can be achieved using `MinMaxScaler`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.33333333,  0.        ,  1.        ],\n",
       "       [ 1.        ,  0.66666667,  0.33333333],\n",
       "       [ 0.        ,  1.        ,  0.        ]])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "min_max_scaler = preprocessing.MinMaxScaler()\n",
    "X_min_max = min_max_scaler.fit_transform(X)\n",
    "X_min_max"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, the data will be scaled to fall within 0 and 1. We can specify different ranges by\n",
    "passing a keyword argument `feature_range` to the `MinMaxScaler` constructor:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ -3.33333333, -10.        ,  10.        ],\n",
       "       [ 10.        ,   3.33333333,  -3.33333333],\n",
       "       [-10.        ,  10.        , -10.        ]])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "min_max_scaler = preprocessing.MinMaxScaler(feature_range=(-10, 10))\n",
    "X_min_max2 = min_max_scaler.fit_transform(X)\n",
    "X_min_max2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Binarizing features\n",
    "\n",
    "Finally, we might find ourselves not caring too much about the exact feature values of the\n",
    "data. Instead, we might just want to know if a feature is present or absent. **Binarizing** the\n",
    "data can be achieved by **thresholding** the feature values. Let's quickly remind ourselves of\n",
    "our feature matrix, `X`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 1., -2.,  2.],\n",
       "       [ 3.,  0.,  0.],\n",
       "       [ 0.,  1., -1.]])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's assume that these numbers represent the thousands of dollars in our bank accounts. If\n",
    "there are more than 0.5 thousand dollars in the account, we consider the person rich, which\n",
    "we represent with a 1. Else we put a 0. This is akin to thresholding the data with\n",
    "`threshold=0.5`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 1.,  0.,  1.],\n",
       "       [ 1.,  0.,  0.],\n",
       "       [ 0.,  1.,  0.]])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "binarizer = preprocessing.Binarizer(threshold=0.5)\n",
    "X_binarized = binarizer.transform(X)\n",
    "X_binarized"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is a matrix made entirely of ones and zeros."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Handling missing data\n",
    "\n",
    "Another common need in feature engineering is the handling of missing data. For example,\n",
    "we might have a dataset that looks like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy import nan\n",
    "X = np.array([[ nan, 0,   3  ],\n",
    "              [ 2,   9,  -8  ],\n",
    "              [ 1,   nan, 1  ],\n",
    "              [ 5,   2,   4  ],\n",
    "              [ 7,   6,  -3  ]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most machine learning algorithms cannot handle the **Not a Number (NAN)** values (`nan` in\n",
    "Python). Instead, we first have to replace all the nan values with some appropriate fill\n",
    "values. This is known as **imputation** of missing values.\n",
    "\n",
    "Three different strategies to impute missing values are offered by scikit-learn:\n",
    "- `'mean'`: Replaces all nan values with the mean value along a specified axis of the\n",
    "  matrix (default: axis=0).\n",
    "- `'median'`: Replaces all nan values with median value along a specified axis of\n",
    "  the matrix (default: axis=0).\n",
    "- `'most_frequent'`: Replaces all nan values with the most frequent value along a\n",
    "  specified axis of the matrix (default: axis=0).\n",
    "  \n",
    "For example, the `'mean'` imputer can be called as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 3.75,  0.  ,  3.  ],\n",
       "       [ 2.  ,  9.  , -8.  ],\n",
       "       [ 1.  ,  4.25,  1.  ],\n",
       "       [ 5.  ,  2.  ,  4.  ],\n",
       "       [ 7.  ,  6.  , -3.  ]])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import Imputer\n",
    "imp = Imputer(strategy='mean')\n",
    "X2 = imp.fit_transform(X)\n",
    "X2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's verify the math by calculating the mean by hand, should evaluate to 3.75 (same as `X2[0, 0]`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3.75, 3.75)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.mean(X[1:, 0]), X2[0, 0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, the `'median'` strategy relies on the same code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 3.5,  0. ,  3. ],\n",
       "       [ 2. ,  9. , -8. ],\n",
       "       [ 1. ,  4. ,  1. ],\n",
       "       [ 5. ,  2. ,  4. ],\n",
       "       [ 7. ,  6. , -3. ]])"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "imp = Imputer(strategy='median')\n",
    "X3 = imp.fit_transform(X)\n",
    "X3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's make sure the median of the column evaluates to 3.5 (same as `X3[0, 0]`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3.5, 3.5)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.median(X[1:, 0]), X3[0, 0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Representing Data and Engineering Features](04.00-Representing-Data-and-Engineering-Features.ipynb) | [Contents](../README.md) | [Reducing the Dimensionality of the Data](04.02-Reducing-the-Dimensionality-of-the-Data.ipynb) >"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
